Skip to content

Conversation

@Matt711
Copy link
Contributor

@Matt711 Matt711 commented Apr 7, 2025

Description

Contributes to #16483.

This PR adds a new libcudf API: cudf::table_to_array, which copies data from a table_view into a preallocated column-major device array using cub::DeviceMemcpy::Batched.

The primary use case for this API is to accelerate the conversion of a cudf.DataFrame to a CuPy array when users access DataFrame.values in Python.

In a follow-up PR, I'll integrate this API into the cudf Python layer.

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 7, 2025
@Matt711 Matt711 added feature request New feature or request non-breaking Non-breaking change labels Apr 7, 2025
@Matt711
Copy link
Contributor Author

Matt711 commented Apr 7, 2025

/ok to test

@Matt711 Matt711 marked this pull request as ready for review April 8, 2025 03:51
@Matt711 Matt711 requested review from a team as code owners April 8, 2025 03:51
@Matt711 Matt711 requested review from bdice and vuule April 8, 2025 03:51
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial feedback attached. Thanks for your work on this!

@Matt711
Copy link
Contributor Author

Matt711 commented Apr 8, 2025

Benchmark Results

table_to_array

[0] NVIDIA RTX 5880 Ada Generation

num_rows columns Samples CPU Time Noise GPU Time Noise GlobalMem BW BWUtil
262144 2 24544x 24.381 us 20.36% 20.378 us 5.97% 205.824 GB/s 21.44%
2097152 2 8656x 61.408 us 6.84% 57.816 us 2.96% 580.363 GB/s 60.45%
16777216 2 1424x 357.499 us 1.32% 353.912 us 0.85% 758.480 GB/s 79.00%
32768 10 25488x 23.607 us 21.45% 19.629 us 6.31% 133.547 GB/s 13.91%
262144 10 11792x 46.189 us 9.27% 42.504 us 3.45% 493.399 GB/s 51.39%
2097152 10 2192x 233.055 us 1.91% 229.399 us 1.06% 731.356 GB/s 76.18%
16777216 10 544x 1.666 ms 1.23% 1.663 ms 1.21% 807.174 GB/s 84.07%
32768 100 9744x 55.100 us 8.09% 51.326 us 3.50% 510.739 GB/s 53.20%
262144 100 1744x 290.759 us 1.57% 287.065 us 0.91% 730.551 GB/s 76.09%
2097152 100 352x 2.059 ms 1.30% 2.056 ms 1.29% 816.148 GB/s 85.01%
16777216 100 32x 16.006 ms 0.26% 16.002 ms 0.25% 838.756 GB/s 87.36%

[1] NVIDIA RTX 5880 Ada Generation

num_rows columns Samples CPU Time Noise GPU Time Noise GlobalMem BW BWUtil
32768 2 15344x 36.213 us 13.56% 32.615 us 8.08% 16.075 GB/s 1.67%
262144 2 14064x 39.508 us 12.78% 35.582 us 6.11% 117.878 GB/s 12.28%
2097152 2 6496x 80.498 us 4.96% 77.064 us 2.18% 435.412 GB/s 45.35%
16777216 2 1648x 546.025 us 1.37% 542.291 us 1.18% 495.003 GB/s 51.56%
32768 10 13840x 39.317 us 9.62% 36.142 us 4.01% 72.532 GB/s 7.55%
262144 10 8544x 61.967 us 6.53% 58.526 us 2.85% 358.326 GB/s 37.32%
2097152 10 2064x 246.372 us 2.02% 242.862 us 1.07% 690.814 GB/s 71.95%
16777216 10 928x 1.877 ms 2.02% 1.873 ms 2.01% 716.410 GB/s 74.62%
32768 100 7632x 68.997 us 5.91% 65.532 us 2.68% 400.023 GB/s 41.66%
262144 100 1680x 303.596 us 1.48% 300.142 us 0.93% 698.719 GB/s 72.78%
2097152 100 576x 2.117 ms 0.80% 2.114 ms 0.78% 793.753 GB/s 82.67%
16777216 100 31x 16.148 ms 0.26% 16.144 ms 0.26% 831.395 GB/s 86.60%

@Matt711 Matt711 requested review from bdice and vuule April 21, 2025 15:00
@Matt711 Matt711 added this to libcudf Apr 22, 2025
@Matt711 Matt711 moved this to In progress in libcudf Apr 22, 2025
@ttnghia
Copy link
Contributor

ttnghia commented Apr 24, 2025

What is the common functionality between this and chunk_pack (#13260)? If they are similar, can the two be refactored to share implementation?

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments -- overall this is close!

@bdice
Copy link
Contributor

bdice commented Apr 25, 2025

What is the common functionality between this and chunk_pack (#13260)? If they are similar, can the two be refactored to share implementation?

I looked over the splitting/copying APIs. This is similar to cudf::pack but does not serialize any table metadata or null masks. It's just the raw contents of the columns (must be same type), concatenated into an existing user-provided GPU buffer. That last part (an existing buffer) is crucial because we want to hand off memory ownership of the results to CuPy or PyTorch without incurring a copy or tying the lifetime of the results to a specific RMM memory resource. I don't see any obvious overlap here, but I also haven't read the source code of those APIs to know if there are common paths.

@vuule vuule requested a review from bdice April 25, 2025 19:20
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few places where the code could be simplified, otherwise looks great!

@Matt711 Matt711 moved this to In Progress in cuDF Python May 1, 2025
@Matt711 Matt711 requested review from lamarrr and vuule May 5, 2025 19:45
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for addressing the comments!
few more things and it should be good to go.

*/
void table_to_array(table_view const& input,
device_span<cuda::std::byte> output,
cudf::data_type output_dtype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we avoid passing this and derive this type from the column types, since we throw on mismatch anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done b71ec3a

@Matt711 Matt711 requested a review from vuule May 6, 2025 17:31
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@Matt711
Copy link
Contributor Author

Matt711 commented May 7, 2025

Thanks all!

@Matt711
Copy link
Contributor Author

Matt711 commented May 7, 2025

/merge

@rapids-bot rapids-bot bot merged commit 7e555e0 into rapidsai:branch-25.06 May 7, 2025
112 of 113 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python May 7, 2025
@GregoryKimball GregoryKimball removed this from libcudf May 8, 2025
rapids-bot bot pushed a commit that referenced this pull request May 15, 2025
Contributes to #16483 by adding fast paths to `DataFrame.to_cupy` (which is called when `DataFrame.values` is called). The PR follows up #18450 to add cython bindings for `cudf::table_to_array` to pylibcudf and plumbs those changes through cudf classic.

I benchmarked the fast (True) and slow (False) when the dataframe has 1, 6, 20, and 100 columns. The fast paths use `cudf::table_to_array` if the number of columns is greater than 1 and `cp.asarray` directly if the dataframe has only one column. The slow path uses a [raw python loop + assignment](https://github.com/rapidsai/cudf/blob/35d58394e7fb5a090ff3cda351403ec092476af5/python/cudf/cudf/core/frame.py#L520) to create the cupy array.
![image](https://github.com/user-attachments/assets/4c9edfa0-e15d-4902-b597-675cfb02343d)

I used the median because the CUDA overhead of calling `cudf::table_to_array` is large (so there are outliers in the times). Here is a profile of calling `to_cupy` twice for both the slow and fast paths. 
![Screenshot from 2025-05-13 12-23-46](https://github.com/user-attachments/assets/d84fdfa3-3696-4df8-91b6-3eb9dde65430)

In the first calls, the fast path takes 7.3 ms vs 4.8 ms for the slow path. The first call to `cudf::table_to_array` is the bottleneck. But if you compare the second calls, the fast path is much faster (79 us vs 2.3ms)

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #18801
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants